ggplotDamian Pavlyshyn
A standard-form data table is a matrix of values where
In this lecture we will see how to use a data table to gain insight about the variables (corresponding to columns) and how they relate with each other.
This is essentially a definition of data presentation.
We start with a big and unwieldy table of numbers. How do we extract useful information about it?
Try this out on some vectors and dataframes
str, summaryhead, tail, headnames, dim, nrow, ncolmean, median, sd, varEach row specifies a graphical element of a plot
## Classes 'tbl_df', 'tbl' and 'data.frame': 32 obs. of 7 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cylinders : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
## $ weight : num 2620 2875 2320 3215 3440 ...
## $ horsepower : num 110 110 93 110 175 105 245 62 95 123 ...
## $ engine : Factor w/ 2 levels "V-shaped","straight": 1 1 2 2 1 2 1 2 2 2 ...
## $ transmission: Factor w/ 2 levels "automatic","manual": 2 2 2 1 1 1 1 1 1 1 ...
## $ gears : num 4 4 4 3 3 3 3 4 4 4 ...
Ingredients of a plot:
ggplot with the aes (aesthetic) functiongeom_ prefixcoord_ prefixggplot(data = efficiency, aes(x = weight, y = mpg, color = transmission)) +
geom_point(size = 3) +
ggtitle("Fuel efficiency vs vehicle weight")ggplot(data = efficiency, aes(x = horsepower)) +
geom_histogram(bins = 10) +
geom_freqpoly(aes(color = engine), bins = 10)## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
Notice that number of cylinders is a number, not a factor, so it is treated as a continuous variable.
But the only cylinder numbers are 4, 6 and 8, so we probably want to treat them as discrete, after all, the above graphic has a color designation for 3.56 cylinders, which isn’t at all useful!
By converting the number of cylinders to the factor type, R now knows to treat it as a discrete variable and the resulting plot makes much more sense!
What is the distribution of cylinders in my dataset?
ggplot(data = efficiency, aes(x = cylinders)) +
geom_bar() +
ggtitle("Count by cylinders") +
xlab("No. of cylinders")What is the distribution of miles per gallon in my dataset?
ggplot(data = efficiency, aes(x = mpg)) +
geom_histogram() +
ggtitle("Histogram of miles per gallon")Not ideal: too many bins, which defeats the purpose of a histogram. We can manually specify the bins using the breaks option.
ggplot(data = efficiency, aes(x = mpg)) +
geom_histogram(breaks = seq(10, 35, 5)) +
ggtitle("Histogram of miles per gallon")What is the relationship between mpg and weight?
ggplot(data = efficiency, aes(y = mpg, x = weight)) +
geom_point(size = 2) +
ggtitle("Miles per gallon vs. weight")What is the relationship between mpg and time?
We will plot the yearly mean mpg against the year. To create the corresponding table, we use the following code, which we will explain in later lectures.
library(fueleconomy)
data(vehicles)
vehicles <- vehicles %>%
group_by(year) %>%
summarize(`mean highway mpg` = mean(hwy))
head(vehicles)## # A tibble: 6 x 2
## year `mean highway mpg`
## <dbl> <dbl>
## 1 1984 19.1
## 2 1985 23.0
## 3 1986 22.7
## 4 1987 22.4
## 5 1988 22.7
## 6 1989 22.5
Now, we make our usual scatterplot
ggplot(data = vehicles, aes(y = `mean highway mpg`, x = year)) +
geom_point() +
ggtitle("Mean highway mpg by year")Hmmm, not so good…
Let’s replace geom_point with geom_line:
ggplot(data = vehicles, aes(y = `mean highway mpg`, x = year)) +
geom_line() +
ggtitle("Mean highway mpg by year")For each value of cylinder, what is the distribution of mpg like?
p <- ggplot(data = efficiency, aes(x = cylinders, y = mpg)) +
ggtitle("Distribution of mpg by cylinders")We can store parts of a plot as a variable and re-use it with different layers:
p <- ggplot(data = efficiency, aes(x = cylinders, fill = engine)) +
ggtitle("Count by cylinders") +
xlab("No. of cylinders")In a bar plot, we have different ways of arranging the bars:
These aesthetics are shared by many different geoms and so are good to know off the top of you head
x, y: coordinatescolor: (out)line color, fill: fill colorsize: point size or (out)line width, shape: shape of points (circle, x, square etc…)linetype: solid, dashed, dottet, etc. line specificationalpha: transparencygroup: which points to link together with linesSome geoms have special aesthetics - these are usually documented in the help file for the corresponding geom.
We’ve gone over many of these in the previous slides, but they’re assembled in this list for reference
geom_point(): Points on a scatter plot. Requires x and y aesthetics.geom_line(): Points connected by a line in order of increasing x coordinated. Requires x and y aesthetics. geom_path() is similar, but connects the points in the order that they appear in the data frame, which is useful for drawing lines that are not of functions of the x-axis.geom_histogram(): Histogram of values in column specified by x. geom_freqpoly() is a similar geom that is essentially just the outline of a histogram and is useful when you want to overlay several histograms.geom_bar(): Bar chart indicating the number of observations in each of the categories specified by x. If you supply a y aesthetic and pass the argument stat = "identity", the y aesthetic will specify the height of each bar.geom_polygon(): Shape with vertices specified by x-y coordinates. Make sure to include a group aesthetic to specify which polygon each observation is part of. This is useful for drawing maps.rgb(0,0,1), rgb(1,0,0), rgb(0,0,0), rgb(1,1,1)